Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

BONN: Bayesian Optimized Binary Neural Network

)XUYY+TZXUV_

2UYY

(G_KYOGT

,KGZ[XK2UYY

8K2;

)UT\

8K2;

)UT\

(G_KYOGT

6X[TOTM2UYY

(G_KYOGT

1KXTKR2UYY

FIGURE 3.20

By considering the prior distributions of the kernels and features in the Bayesian frame-

work, we achieve three new Bayesian losses to optimize the 1-bit CNNs. The Bayesian kernel

loss improves the layerwise kernel distribution of each convolution layer, the Bayesian fea-

ture loss introduces the intraclass compactness to alleviate the disturbance induced by the

quantization process, and the Bayesian pruning loss centralizes channels following the same

Gaussian distribution for pruning. The Bayesian feature loss is applied only to the fully

connected layer.

learning are intrinsically inherited during model quantization and pruning. The proposed

losses can also comprehensively supervise the 1-bit CNN training process concerning kernel

and feature distributions. Finally, a new direction on 1-bit CNN pruning is explored further

to improve the compressed model’s applicability in practical applications.

3.7.1

Bayesian Formulation for Compact 1-Bit CNNs

The state-of-the-art methods [128, 199, 77] learn 1-bit CNNs by involving optimization in

continuous and discrete spaces. In particular, training a 1-bit CNN involves three steps:

a forward pass, a backward pass, and a parameter update through gradient calculation.

Binarized weights (ˆx) are only considered during the forward pass (inference) and gradient

calculation. After updating the parameters, we have the total precision weights (x). As

revealed in [128, 199, 77], how to connect ˆx with x is the key to determining the performance

of a quantized network. In this chapter, we propose to solve it in a probabilistic framework

to learn optimal 1-bit CNNs.

3.7.2

Bayesian Learning Losses

Bayesian kernel loss: Given a network weight parameter x, its quantized code should

be as close to its original (full precision) code as possible, so that the quantization error is

minimized. We then deﬁne:

y = w⁻¹◦ˆx −x,

(3.97)

where x, ˆx ∈Rⁿare the full precision and quantized vectors, respectively, w ∈Rⁿdenotes

the learned vector to reconstruct x, ◦represents the Hadamard product, and y ∼G(0, ν)